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Abstract 
Tatsuoka (1984) suggested several extended caution indices and their standardized versions 
that have been used as person-fit statistics by researchers such as Drasgow, Levine, and 
McLaughlin (1987), Glas and Meijer (2003), and Molenaar and Hoijtink (1990). However, 
these indices are only defined for tests with dichotomous items. This paper extends two of 
the popular standardized extended caution indices (Tatsuoka, 1984) for use with polytomous 
items and mixed-format tests. Two additional new person-fit statistics are obtained by 
applying the asymptotic standardization of person-fit statistics for mixed-format tests 
(Sinharay, 2016c). Detailed simulations are then performed to compute the Type I error 
rate and power of the four new person-fit statistics. Two real data illustrations follow. 
The new person-fit statistics appear to be satisfactory tools for assessing person fit for 


polytomous items and mixed-format tests. 
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Person-fit assessment (PFA) is concerned with uncovering atypical test performance 
as reflected in the pattern of scores on individual items in a test (Meijer & Sijtsma, 
2001). In a report for the Council of Chief State School Officers, Olson and Fremer (2013) 
recommended the use of person-fit statistics (PFSs), in addition to other methods, to detect 
irregularities in answering behavior. 

Several PFSs have been proposed in the context of tests with dichotomous items—see 
comprehensive reviews of them by, for example, Karabatsos (2003) and Meijer and Sijtsma 
(2001). There exist fewer PFSs (e.g., Drasgow, Levine, & Williams, 1985; Emons, 2008; 
Glas & Dagohoy, 2007; van Krimpen-Stoop & Meijer, 2002; von Davier & Molenaar, 2003; 
Wright & Masters, 1982) for tests with polytomous items. 

There is a severe lack of research on PFA for mixed-format tests (MFTs), which are 
tests that include both dichotomous and polytomous items, Finkelman and Kim (2007), 
Sinharay (2016c), Sinharay (2015), and Tendeiro (2017) being some exceptions. Polytomous 
items and MF'Ts promise to become more common because of an increasing emphasis 
on constructed-response items in the common-core assessments (e.g., Darling-Hammond 
& Adamson, 2010, p. 1). Constructed-response items constitute an integral part of the 
assessment design of both of the federally-funded assessment consortia—the Smarter 
Balanced Assessment Consortium and the Partnership for the Assessment of Readiness 
for College and Career (e.g., Lissitz, Hou, & Slater, 2012). Therefore, research on PFA 
for polytomous items and MFT’s promise to be useful to educational and psychological 
measurement. 

Tatsuoka (1984) suggested several extended caution indices (ECJ) and their 
standardized versions (FCI,) that can be used as PFSs. The standardized versions of the 
second ECI (denoted as FCJ2, or ¢,) and the fourth ECI (ECI4, or ¢2) are arguably the 
most popular among the caution indices and their extensions, which is evident from the 
fact that several researchers (e.g., Drasgow et al., 1987; Glas & Meijer, 2003; Karabatsos, 
2003; Molenaar & Hoijtink, 1990) found ¢, and ¢) to be useful in detecting person misfit. 
However, ¢; and ¢ can be applied only to tests with dichotomous items. Sinharay (2015) 
suggested an extension of ¢) to MFTs, but the extended PFS had low power. 


This paper suggests four new PFSs including two extensions each of ¢; and ¢2 for use 
with polytomous items and MFTs. Two of the new PFSs are based on the asymptotic 
standardization/correction of PFSs for MFTs recently suggested by Sinharay (2016c). 

The Literature Review section includes reviews of the ¢,; and ¢ statistics for 
dichotomous items (Tatsuoka, 1984), the extension of the ¢2 statistic suggested by Sinharay 
(2015), and the asymptotic standardization/correction of PFSs for MFTs (Sinharay, 2016c). 
The Methods section includes the descriptions of the new PFSs. The Type I error rate and 
power of the new PFSs are examined for several simulated data sets in the Simulations 
section. In the Application section, the suggested PFSs are computed for two real data 
sets. Conclusions and recommendations are provided in the last section. 

The new PFSs are based on item response theory (IRT). Non-parametric PFSs (for 
example, Emons, 2008; Sijtsma, 1998; Tendeiro & Meijer, 2014) are not considered in this 


paper. 


Literature Review 
Review of the ¢; and (2 Statistics for Dichotomous Items 


Consider a test comprising J dichotomous items that was administered to n examinees. 
Let us denote the true ability of examinee 7 as 0;. Let y;; be the score (that is 0 or 1) and 
P;(@;) be the probability that y;; is equal to 1 for examinee 7 on item j. For example, for 
the three-parameter logistic model (83PLM), 


exp[a;(9; — b;)] 


PG) = Plug =D = 6+ L-) Og —p 


(1) 
where a,;, b;, and c; respectively are the (known) slope, difficulty, and guessing parameters 
of item 7. 


Let us define 


Let us further define 


Ji 
03 (04) = Var(yis) = Py(0:)[1 — Py(@,)] and PCO.) = >) P,(6)) 


For the i-th examinee, Tatsuoka (1984) defined the second standardized ECI, denoted 
as EC'12, or C1, as 


-, — VialPl) = (Gs =O) 
Xj1 23(6:)(G; — GP? 
and the fourth standardized ECI, denoted as ECI4, or Co, as 
yja1(P; (41) — viz) [P;(9:) — P(6)] 
Djs 03(04) [P3(0:) — PO)]? 


Note that the denominator in each of Equations 3 and 4 is the standard deviation (SD) 


(3) 


(= (4) 


of the numerator. Note also that the ECI6, statistic of Tatsuoka (1984) is identical to 
ECT4,—so, ¢2 denotes both EC'l4, and ECI6,. To compute ¢; and C2 for a data set, 0; 
has to be replaced by an estimate 6. 

Tatsuoka (1984) assumed the asymptotic null distribution of ¢; or ¢2, with 6; replaced 
by 6;, to be standard normal. In addition, either of ¢; and G2 becomes larger as an examinee 
answers more difficult items correctly | or answers more easy items incorrectly, which 
usually happens when person misfit occurs. Therefore, a large value such as a value larger 
than 1.645 at 5% level of either of ¢; or Cy indicates person misfit. 

Both ¢; and ¢) have been used as PFSs by several researchers such as Drasgow et al. 
(1987), Glas and Meijer (2003), Karabatsos (2003), Li and Olenik (1997), and Molenaar 
and Hoijtink (1990). The ¢2 statistic along with the (Bayesian) posterior predictive model 
checking method performed the best overall among eight PFSs in Glas and Meijer (2003). 
Drasgow et al. (1987) found ¢; and ¢) to have satisfactory power in a comparison of several 
PFSs. Sinharay (2016b) used the results of Snijders (2001) to suggest asymptotically 
standardized versions of the ¢; and ¢2 for dichotomous items and found those standardized 
versions to have satisfactory Type I error rate and power. However, ¢, and ¢2 are defined 
only for dichotomous items. Therefore, keeping in mind the increasing importance of 
polytomous items and MF'Ts, extensions of ¢; and ¢2 to polytomous items and to MFTs 


may be helpful to researchers interested in PFA. 


'For a difficult item, G;-G < 0 and P;(6;)—P(6;) < 0. A correct answer on the item means P;(0;)—yij < 
0 so that both (Pj (0;) asl yi (G; am G) and (P; (0;) — Yij) [P; (i) ia P(0;)| are positive. 
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Notation for Mixed-format Tests 


Consider a test with J items each of which can be dichotomous or polytomous. 
Consider the 7-th examinee whose true ability is 6;. The examinee’s score on item J, yjj, 
can be an integer between 0 and m,;. Let us denote, for k = 0,1,...,m,, 


dk(yij) = (5) 
0 otherwise- 


Let Pj.(6i) = P(yiy =k) = P(de(yiz) = 1) = E(di (iy) (6) 


denote the probability that the score of examinee 7 on item 7 is equal to k. Equations 5 and 
6 subsume items modeled by any common IRT model for appropriate choices of Pj,(4;). 
For example, if item 7 is polytomous and if the generalized partial credit model (GPCM; 
Muraki, 1992) is used for the item, then Equations 5 and 6 subsume the item (and the 


GPCM) for 
exp n=0 a;(0; — b5n)| 
docx EXP[D p= 25(8 — dyn)’ 


where a,;’s are the slope parameters and b,;,’s are the location parameters. If item 7 is 


Py (6i) = 


dichotomous, then m,; becomes equal to 1, and, as a result, y;; can be 0 or 1; then, for 
example, if the 3PLM is used for the item, then Equations 5 and 6 subsume the item (and 
the 3PLM) for 


do(yiz) = 1 — yy, di (yz) = Yaz, Pjo(i) = 1 — Pj(0:), and Pji(@) = P;(), (7) 


where P;(6;) was defined in Equation 1. 

Note that MFTs include as special cases tests with only dichotomous items, tests with 
only polytomous items with the same number of response categories, and tests with only 
polytomous items with varying number of response categories (because m, is allowed to 


vary over 7). 


Review of the Extension of ¢, to MFTs (Sinharay, 2015) 


Sinharay (2015) suggested the following extended version of ¢) for use with MFTs: 


oe a E et a ae ss u(6,)| 


mj 


9 ? 
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(8) 


where U(0;) = 5 eye = etl ) The statistic C> was found to have considerably smaller 
power (often by a margin of 0.10 under some conditions) in the simulations in this paper 


compared to the extensions suggested later—so ¢, is not considered in the remaining of this 


paper. 


Review of the Asymptotic Standardization of PFSs for MFTs 


Sinharay (2016c) considered, for MFTs, only those PFSs that are of the form 


Var(T(6;)) 
JO m5; 
for T(6;) = oe [dic (Yiz) — Pin (i) ] jx (1), (10) 


where w;;,(;) is a real-valued weight function. For example, the l, statistic for polytomous 
items (Drasgow et al., 1985) is of the form given in Equation 9 for w,,(6;) = log Pj, (6). 
The variance Var(7'(6;)) is equal to a w'(9;)Dj(9:)w;(9:), where 


W5 (91) = (wjo(Gi), Wj (Fi), «+» Wim; (41))! 


and D;(6;) = the covariance matrix of (do(yij), di(yij),-- +m, (Yaz) 


The covariance matrix D;(6;) is given by 


Pyo(9i)(1 — Pjo(9:)) — —Pyo(9i)Pix(8i)—--— —Pj0(91) jms (84) 
D;(0;) = =P(0:)P0(@;) - Pa(@) (=P) 
= Pins (6; )Pjo(G: ) — Pim; (0;) Pj (8) ee Pym; (0;)(1 _ fame (0;)) 


Sinharay (2016c) showed that the asymptotic null distribution of the PFS in Equation 9, 
with 6; replaced by an estimate 6;, is not (0, 1), but that of 
T(6;) + cn(9i)0(9;) 


(11) 


Var(T(6;)) 
is N(0,1) under three regularity conditions, where 
—_— rt J ‘ag a x 
Var(T (6:)) = S_ vj(6:)Dj(,)v5 (4), (12) 
j=l 


for vj (Oi) = (Wjo(9i), Wj1 (Gi), .-. , Wim, (:))', 
that is, Var(T(0;)) is obtained by replacing w;,(6;) in the expression of Var(T'(6;)) by 


tn(:), where 


jx (B,) = wyp(Bi) — €,,(8;)8;6(8:), (13) 
F eg Pin (O:) w jx (8 
| jy = Wie Etta PCA dwn) as 
ye pe 0 Pi,(8 9) 8 54(;) 
and 6; satisfies 
A Z ms A Aa 
5o(8;) + 9 > [de(yes) — Pyx(9:)]8;x(8:) = 0 (15) 
j=l k=0 
for some functions so(6;) and sj(6;), where 
P',(6;) = the first derivative of P;,(0;) with respect to 6;- (16) 


Sinharay (2016c) suggested the J} statistic for MFTs that is an extension of the /* 
statistic for dichotomous items (Snijders, 2001) and is of the form given by Equation 11 for 
wy n(9;) = log P;,(0;). Expressions of Pate, ) for the common IRT models such as the 3PLM 
and GPCM can be found in, for example, Tao, Shi, and Chang (2012). The condition 
provided in Equation 15 is satisfied by the maximum likelihood estimate (MLE), weighted 
maximum likelihood estimate (WLE; Warm, 1989), and modal a posteriori (MAP) estimate 


of ability, and, for all these estimates, 


sj4(8;) = : (17) 


The quantity so(6;) can be computed as 
0 if 6, = MLE, 
s0(0;) = a if 6; = MAP, (18) 
a if 6; = WLE, 


where f(6;) is the es distribution on 6;, I(0;) is the information on 6;, and 

F(G:) = 05 ee ee KO) where P‘,(6;) is the second derivative of P;,(6;). The results 
of Sinharay (2016c) are extensions of similar results suggested for dichotomous items 
by Snijders (2001), who applied his results to derive the standardized [* statistic from 
the unstandardized 1, statistic (Drasgow et al., 1985) for dichotomous items. The three 
regularity conditions of Sinharay (2016c) are satisfied by all common IRT models for 
MFTs including combinations of the 1-, 2-, and 3-parameter logistic and probit models, 
the GPCM, the partial credit model (Masters, 1982), and the graded response model 
(Samejima, 1969). Tendeiro (2017) applied the extension of Sinharay (2016c) to suggest a 


PFS Co) that can be applied to unfolding models. 


Methods 
Extensions of ¢; and ¢2 to Mixed-format Tests 
Letus-detine stork = 0.1 eighty, 
= 1S? P,(6) and Gk = <> Gix 
Ae j jk» 
4=1 J aa. 


where Sj is the set of items that have a score category of k and J, is the size of S;. Let us 


define 


Usually, J; would be smaller than J. If all the items have the same number of score 

categories, then J, = J and Sj is the set of all items. 
Then, for examinee 7, an extended version of ¢; for use with MF'T’s can be obtained as 
So ole) — — Gi] 
VIL GOD OIG) 


a= (19) 


where £;(4;) = (Gjo — Go, Gj1 — Gi, +++ , jm; — Gm,;)’ 


and an extended version of ¢2 for use with MFTs can be obtained as 


- oe 6; — dy ig Pep 6; — Py 6; 
<a ed ) = eve ILEe(00) = Fel) és 
XL 24()D, (02) 95(01) 


where g3(9:) = (Pjo(9:) — Po(:), Pj1(0:) — Pi(@:), --- , Pim; (01) — Pm, (6:))’- 


A comparison of Equations 3 and 19 (or of Equations 4 and 20) shows that 


e the probability of a correct answer in Equation 3 (or 4), P;(@;), is replaced in Equa- 


tion 19 (or 20) by P;;,(@;), the probability of a score equal to a specific score category. 


e the binary item score y;; in Equation 3 (or 4) is replaced in Equation 19 (or 20) by the 


binary category score indicator d,(yi;). 


e G in Equation 3 is replaced by G;, in Equation 19 and P(@;) in Equation 4 is replaced 
by P,(0;) in Equation 20. 


e As in Equation 3 (or 4), the denominator of Equation 19 (or 20) is the SD of the 


numerator. 


Thus, the extended versions of ¢,; and ¢) for use with MF'T’s capture person misfit in 
the same manner in which ¢; and ¢) capture misfit for dichotomous items, but involve 
appropriate adjustments for polytomous items. The value of either of the extended version 
of ¢; and ¢ is expected to be large and positive in the presence of person misfit. 

A comparison of Equations 8 and 20 makes it clear that the former does not involve a 
summation over each possible score category of the items whereas the latter does. Thus, 
the statistic provided in Equation 8 involves the loss of some information, which may be 
the reason of its smaller power compared to that provided in Equation 20. 

To compute the above extended versions of ¢, and ¢2 for a data set, 0; has to be replaced 
by an estimate 6;. If all the items on the test are dichotomous, then the right-hand side of 


Equation 19 becomes equal to that of Equation 3 and the right-hand side of Equation 20 
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becomes equal to that of Equation 4; a proof, which involves some algebra, is omitted, but 
the proof is similar to the proof of Theorem 1 in Sinharay (2016c) and can be obtained 


upon request from the author. 


Asymptotic Standardization of the Suggested Extensions 
The extended versions of ¢, (Equation 19) and ¢, (Equation 20) are both special cases 
of the PFS that was considered in Sinharay (2016c) and can be expressed using Equation 9 


for 


wyr(Oi) = —(Gjx — Ge) and wy5.(8i) = —(Pjx(9i) — Pi(8i)), (21) 


respectively. Thus, when 6; is replaced by its MLE, WLE, or MAP, denoted as 6, the 
asymptotic standardization/correction suggested by Sinharay (2016c) can be applied to 
either of ¢; and ¢3 to obtain the corresponding asymptotic standardized/corrected versions. 


Let us denote these standardized/corrected versions as Cf and ¢3, respectively, that is, 


= aE eae wyr(9i) = —(Gjn — Ge), (22) 


where w,(6;) = —(Pix(6:) — P,(6;)), (23) 


and T(6;), Var(T'(6;)), ¢,(6;), and so(6;) are defined in Equations 10, 12, 14, and 18, 


respectively. 


Asymptotic Null Distributions of ¢f and ¢3 


Sinharay (2016c) proved in the context of MFTs that if a PFS is of the form given 
by Equation 9 for some w,,(6;), then the asymptotic null distribution of its standardized 
version provided in Equation 11 is (0,1) under three regularity conditions. Because both 
¢; and ¢) for MFTs are of the form given by Equation 9 for w;;,(@;)’s given by Equation 21, 
the proof of Sinharay (2016c) implies that ¢; and ¢) do not have a standard normal 


asymptotic null distribution, but that their standardized versions Cf and ¢3 provided in 
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Equations 22 and 23 have a standard normal asymptotic null distribution under three 
regularity conditions. The regularity conditions are mild and satisfied by the IRT models 
commonly used for MFTs. 

A test with only polytomous items with the same number of response categories for all 
items or a test with only polytomous items with a varying number of response categories 
is a special case of a MFT; thus, the suggested PFSs Cj and ¢3 have a standard normal 
asymptotic null distribution for such tests as well. If a test includes only dichotomous items, 
then the suggested ¢¥ and C3 statistics becomes identical to the correponding asymptotically 
standardized versions (for dichotomous items) suggested in Sinharay (2016b) and has a 
standard normal asymptotic null distribution. Sinharay (2016c) showed that the PFSs 
obtained by removing the term c/,(4;)s9(0;) from the numerator of Equation 11 also have a 


standard normal asymptotic null distribution; however, the term c’,(4;)s0(6;) is included in 


the remaining of this paper. 


Computations 

Given the data, an IRT model, and the estimated item parameters, the computation of 
Cy and ¢3 for an examinee involves the computation of the quantities /expressions provided 
in the left column Table 1, in the same order as that of the rows of the table, using the 


formulas provided in the right column of the table. 


A Simulation Study 


The Type I error rate and power of (1, G2, ¢j and C3 were examined for a variety of 
simulated MFTs. The simulation study also included the /* statistic (Sinharay, 2016c) that 
is another asymptotically corrected PFS (like ¢{ and ¢3) for use with MFTs. 


Design of the Simulation 


The simulation study involved three levels of test length (12 items, 30 items, and 60 
items) that represented short, moderate, and long tests. Each generated data set involved 


two sets of items, a set of dichotomous items and a set of polytomous items, which resulted 
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Table 1. The quantities that need to be computed in order to compute Cf and ¢3. 


Compute Using 
6; (MLE, WLE, or MAP) _ The scores, the item parameters and a maximization 
algorithm such as the Newton-Raphson algorithm 


die Ys) Equation 5 
ve ike (9;) Equation 6 
S54 (G;) Equation 17 
80(9; Equation 18 
wx (i) Equation 21 
P%,(0i) Equation 16 
Cal i) Equation 14 
W 5, (G:) Equation 13 
T(;) Equation 10 
Var(T(6;)) Equation 12 
Gr and G Equations 22 and 23 


in the data set being like one arising from a MFT. The number of polytomous items was 
4, 10, and 20, respectively (that is, one-third), for the three test lengths. The number of 
response categories for each polytomous item was fixed at three with possible scores being 0, 
1, and 2. Scores on dichotomous and polytomous items were generated using the 3PLM and 
GPCM, respectively. The true slope parameters of all items were generated, as in Glas and 
Dagohoy (2007), from a log-normal distribution with respectively 0 and 0.25 as the mean 
and SD of the logarithm of the variable. The true difficulty and true guessing parameters for 
the dichotomous items were generated from a N(0,1) and a Uniform(0.05,0.3) distribution, 
respectively. The true location parameters of the polytomous items were generated from 
N(-1,0.5) and N(1,0.5) distributions, respectively, as in Chon, Lee, and Dunbar (2010). 
To compute the Type I error of the PFSs, score patterns that fit the IRT 
(3PLM+GPCM) model were generated. To compute the power of the PFSs, score patterns 
that are “corrupted” and do not fit the IRT model were generated in several ways. The item 
parameters are assumed known; because of this assumption, the power does not depend on 
the number of examinees in a data set whose score patterns were corrupted—so the score 


patterns of all examinees were corrupted in each data set used to compute power. 


Lt 


As in other simulation studies on PFA (e.g., Glas & Meijer, 2003; Sinharay, 2016c; van 
Krimpen-Stoop & Meijer, 2002), corrupted score patterns reflected “lack of motivation” 
or “item disclosure/preknowledge”. When “lack of motivation” was simulated, the score 


1 


5 or @ of all items. It was assumed, 


patterns of all examinees involved lack of motivation on 
as in, for example, Glas and Meijer (2003), that the dichotomous items on which an 
examinee lacks motivation are the easiest among all the dichotomous items. The probability 
of a correct response to a dichotomous item on which an examinee lacks motivation was set 
to 0.2 as in Glas and Meijer (2003). For a polytomous item under “lack of motivation”, 2.5 
was subtracted from the examinee ability before generating a score on the item—it was 
found from a preliminary simulation that this reduction of 2.5 was somewhat equivalent 
on an average to setting the probability of a correct answer on a dichotomous item to 0.2. 
2 When “item disclosure” was simulated, the score patterns of all examinees involved the 
assumption that 7 or Z of all items were disclosed to the examinee. It was assumed, as in 
Glas and Meijer (2003), that the dichotomous items on which item disclosure occurs are the 
most difficult among all the dichotomous items. The score on a disclosed item (dichotomous 
or polytomous) was set equal to m,, the highest possible score on the item. 

For each simulation condition (where an example of a simulation condition is “12 
items and lack of motivation on F items”), 1,000 data sets with 1,000 examinees each 
were simulated; the true item parameters were simulated once for each of the 1,000 data 
sets. The true 6;’s of the examinees were uniformly drawn from one of the following 9 
values: —2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5, and 2.0. This design, which was also used by 
researchers such as Snijders (2001) and van Krimpen-Stoop and Meijer (1999), allows the 
computation of the Type I error rate and power at these 6-values accurately. The Type I 
error rates at levels of 1% and 5% for a PFS at a specific 6-value for a test length was 


computed as the proportion of score patterns fitting the IRT model under that test length 


?Consider a dichotomous item with 1, 0, and 0.15 as the estimated slope, difficulty, and guessing 
parameters—these values are all close to the average of the generating item parameters—so the item can be 
considered as an average item. The probability of a correct answer on the item is 0.58 for ability 0 (average 


ability). To make the probability of a correct answer equal to 0.2, the ability has to be smaller than -2.6. 


12 


with that 6-value for which the PFS was statistically significant (under a standard normal 
null distribution assumption). Thus, the Type I error rate at any of the nine values of 6; is 
computed from about 111,111 (& 1000x 1000/9) examinees for any given test length. The 
standard error is approximately 0.0003 when the Type I error is close to 0.01 and 0.0007 


when the Type I error is close to 0.05; that is because, for example, when the Type I error is 


close to 0.01, the corresponding standard error is equal to / (0.01 x 0.99)/111111 0.0003. 
The power at 5% level for a PFS at a 6-value for a simulation condition was computed 
as the proportion of the corrupted score patterns under that simulation condition with 
that 0-value for which the PFS was statistically significant (under a standard normal null 
distribution assumption) at that level. Thus, the power at each ability was computed from 
about 111,111 (#1,0001,000/9) examinees in any simulation condition. The standard 
error for any value of power is always smaller than 0.0015 (the maximum occurring near 


power values of 0.5). 


Computations 


Fortran 90 programs written by the author were used for the computation of the 
estimates of ability and the PFSs. 


For any simulation condition, the following steps were repeated 1,000 times: 
1. Simulate a set of true item parameters; simulate 1,000 true ability values (representing 


1,000 examinees) uniformly from the nine above-mentioned values. 


2. Use the above true item and ability parameters and the IRT model (3PLM+GPCM) to 
simulate the item scores on a data set; simulate score patterns from the IRT model for 
simulation conditions to compute the Type I error rate and simulate corrupted score 


patterns for simulation conditions to compute the power. 
3. Compute 6; of all the examinees from the data set using the true item parameters. 


4. Compute ¢1, C2, C7, ¢3, and J? for each examinee in the data set using 6; and the true 
item parameters. Compute the p-values corresponding to these PFSs under a standard 


normal null distribution assumption. 
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Results: Null Distributions of the PFSs 


True Ability= 0 


True Ability= 1 True Ability= 2 


fo ans a ee aa ea oe co 


True Ability= 2 


Figure 1. Null Distributions of the PFSs for three values of ability for the 60-item test. 


The top three panels of Figure 1 show the distributions of the observed values of 

Cy (a line with dots of regular font) and ¢ (line with bold dots) computed from all the 
response patterns that fit the IRT model and were simulated using true abilities 0, 1, and 2 
(mentioned in the titles of the panels) for 60-item tests. The distributions for true ability 
values of -1 and -2 are very similar to those for true ability values of 1 and 2, respectively, 
and are not shown here. The density of the standard normal distribution is also shown 
using a solid line. The top panels of the figure show that the distributions of ¢; and ¢} 
are virtually indistinguishable for ability=0, but differ for non-zero values of the ability. 


The top panels of the figure also show that the right tail of the distribution of ¢; is lighter 
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than that of the standard normal distribution for values of the PFS between 1.0 and 2.0 
(roughly )—this phenomenon is expected to manifest itself as ¢; being conservative for true 
ability of 1.0 or more at 5% level (note that 1.645, the 95th percentile of the standard 
normal distribution, lies between 1.0 and 2.0). In contrast, the right tail of the distribution 
of ¢f follows that of the standard normal distribution more closely for values between 1.0 
and 2.0—so the Type I error rates of ¢{ are expected to be close to the nominal level at 5% 
level. However, for values of the PFSs larger than 2 (that is close to the 2nd percentile of 
the standard normal distribution), the distribution of Cj has a heavier tail compared to the 
standard normal distribution—so the PFS is expected to have a Type I error rate slightly 
larger than the nominal level at levels around 1%. 

The bottom three panels of Figure 1 show the distributions of the observed values of 
Gy, G3, and —l* (the negative of —/% is plotted so that a large value of each PFS plotted in 
these panels indicates a person misfit) for true abilities of 0, 1, and 2. The standard normal 
distribution is also shown as a solid line. The three distributions are quite close in the 
bottom left panel, but slightly differ, especially for PFS-values larger than 2, in the bottom 
middle and bottom right panels. 


Results: Type I Error Rates 


The Type I error rates (and power) of the PFSs did not depend on whether the 
MLE (truncated between -4.0 and 4.0), WLE, or MAP was used as the ability estimate in 
the computations in this paper. Therefore, only the results for the MLEs will be reported 
from the simulation study. Figure 2 shows the Type I error rates of the five PFSs for all 
test lengths for significance levels of 1% and 5%. The title of each panel denotes the test 
length and the level of significance. In each panel, the true examinee ability is shown along 
the X-axis, and the Type I error rate is shown along the Y-axis. Note that the range of 
the Y-axis is the same in the three panels on the left (all corresponding to 1% significance 
level) and the same in the three panels on the right (all corresponding to 5% significance 
level). For each PFS, the 9 values of the Type I error rate (at true ability of -2.0, -1.5, 


-1.0,...,1.5, and 2.0), shown using a solid circle, solid square, hollow circle, hollow square, or 
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Figure 2. Type I Error Rates for all test lengths at 1% and 5% levels. 
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solid triangle (for C7, G3, G1, G2, and /%, respectively), are joined using a dashed line in each 


panel. A horizontal solid line denotes the significance level in each panel. 


Figure 2 shows that the Type I error rates of the PFSs are close to those of each other 


around true ability of 0. However, the Type I error rates of the PFSs become divergent as 


the true ability becomes extreme. Figure 2 further shows that: 


The relative performance of the PFSs at 1% level is very similar to that at 5% level 


The Type I error rates of ¢; are smaller than those of ¢{ and those of ¢) are smaller 


than those of ¢3. 
The Type I error rates of ¢3 are smaller than those of C7. 


For negative true abilities, the Type I error rates of /% are closest to the nominal level 


among the three asymptotically corrected PFSs. 


For positive true abilities, the Type I error rates of ¢5 are closest to the nominal level 


among the three asymptotically corrected PFSs. 


The Type I error rates of /? often exceed the significance level substantially, especially 
at 1% level and for large positive abilities. The inflation of the Type I error rates of 
I* at small significance levels has been observed by several researchers such as Snijders 


(2001), van Krimpen-Stoop and Meijer (1999), and Sinharay (2016c). 


* 


The Type I error rates of Cf and C3 are also inflated. Compared to /?, the extent of 
Type | error inflation is slightly larger for ¢} (thus, ¢{ has the largest Type I error rates 
on average among the PFSs considered here) and slightly smaller for ¢5. The greatest 


extent of inflation is observed at extreme true abilities. 


The extent of inflation of Type I error rates is smaller at 5% level compared to 1% 


level. 


The Type I error rates of ¢; and ¢ are slightly larger than the nominal level around 
true ability of 0, but become smaller as true ability becomes more extreme, and are 


sometimes smaller than the nominal level for extreme true ability. 


ibs 


e For 60-item tests and 5% level, the Type I error rates of all the PFSs are smaller than 
0.06 for all true abilities. 


Results: Distribution for Short Tests 


The Type I error rates of ¢j and /%, respectively, were found to rise sharply, especially 
at 1% level, for 12-item tests in Figure 2 for true ability of -2 and 2, respectively. Figure 3 
shows the distributions of the observed values of Cf, ¢3, and —I% for all the response patterns 
that fit the IRT model and were simulated using true abilities of -2, -1, 0, or 2 (shown in 
the title of each panel) for 12-item tests. The density of the standard normal distribution is 
also shown using a solid line. The figure shows that for 12-item tests, the null distributions 
of Gy, ¢3, and /% deviate from the standard normal distribution for all the true abilities 
and the deviation becomes larger as the true ability becomes more extreme. It seems that 
overall, the distribution of CF is closer to the standard normal distribution compared to Cj 
and /}—this is most clear in the bottom right panel where the distributions of Cy and /? 
are much taller than that of the standard normal distribution. The small peaks for ability 
larger than 2 for ¢j in the top left panel and —I? in the bottom right panel show that there 
are considerably more number of values of ¢{ and —l? larger than 2.0 in these cases than 
can be expected from a standard normal variable. This phenomenon results in their large 
Type I error rates in those cases (for true ability of -2 for Cf and 2 for /*, both for 12-item 
tests). Also note that the empirical distribution of /* for a test with six dichotomous 
items, shown in Figure 2B of Meijer and Tendeiro (2012), was considerably different from a 
standard normal distribution and looks very much like the bottom right panel of Figure 3 of 
this paper. However, Figure 2B of Meijer and Tendeiro (2012) or Figure 3 of this paper do 
not provide any evidence against the asymptotic normality of I?, ¢{, or ¢3; the asymptotic 
normality of these PFSs holds for long tests, as noted by Meijer and Tendeiro (2012)—so it 
is natural that the normality did not hold for a 6-item test in Meijer and Tendeiro (2012) 
and for the 12-item test in this paper. 

Even though the distribution of each of Cf, ¢3, and J} depart from a standard normal 


distribution for short tests, Figure 3 (and Figure 2 to a certain extent) shows that among 
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True Abilit 


i 


Figure 3. Null Distributions of ¢y, ¢>, and —l? for 12-item Tests. 


these three PFS, the null distribution of ¢3 is closest to a standard normal distribution for 
short tests. Therefore, these figures show some evidence that ¢} may be preferred as a PFS 


over /% and Cj for short tests. 
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Item Disclosure 1/6th Item Disclosure 1/3rd 


Lack of Motivation 1/6th Lack of Motivation 1/3rd 


Figure 4. Power at 5% level for 12-item Tests. 


Results: Power 

Figures 4 to 6 show the values of power at 5% level for the different test lengths and 
different types of misfit. Each figure (representing a test length) shows the power under 
four types of misfit. The vertical axis in each panel of these figures ranges from 0 to 1. The 


figures show that Cf is always more powerful than ¢; and C3 is always more powerful than 
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Item Disclosure 1/6th Item Disclosure 1/3rd 


_2 -1 0 1 2 
Lack of Motivation 1/3rd 


Figure 5. Power at 5% level for 30-item Tests. 


¢2. In addition, these figures show that 


e Either one among ¢f and ¢; is more powerful than both of ¢5 and ¢2 on some occasions, 
such as for 8 < 0 in the top left panel of Figure 4 and for @ > 0 in the bottom left 


panel of Figure 4 
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Item Disclosure 1/6th Item Disclosure 1/3rd 


-1 0 1 


Lack of Motivation 1/3rd 


Figure 6. Power at 5% level for 60-item Tests. 


e Either one among ¢F and C2 is more powerful than both of ¢{ and ¢; on some other 


occasions, such as for —1 < # < 1 in the top right panels of Figures 4 and 5 


e The power of all of Cf, ¢3, and ¢2 are very close in the two rightmost panels of Figures 5 


and 6, that is, when the test is at least moderately long and when the percentage of 
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aberrant responses is high 


e The power of /* is larger than all of Ci, G1, ¢3, and ¢2 on a few occasions (e.g., for 
ability smaller than -1 in the top right panel of Figure 4), but is smaller than that of 
one of the others on most other occasions and is the smallest in several cases (such as 


around 6 = 0 in each panel in Figure 5). 


Discussion of the Comparative Performance of the PFSs 


The above results show that there is no single PFS that outperforms the others for 
all simulation conditions. For example, the Type I error rates of ¢* are often inflated, but 
the power of Cf is often the largest; and the Type I error rates of [? are often very close 
to the nominal level, but the power of /* is often smaller than that of ¢{ and ¢} on most 
occasions. This finding supports the statement of Tendeiro and Meijer (2014, p. 257) that 
different PFSs may have different sensibility to detect aberrant behavior under various 
testing conditions and probably imply that one should use all of Cy, ¢3, and lf to assess 
person fit for any given MFT and combine information from them (and potentially from 
other information sources) to make an overall decision on person fit for any given examinee. 

However, the performance of ¢3 seems to be the best overall by a small margin. The 
Type I error rates of ¢3, although slightly inflated occasionally (especially at 1% level), are 
the least inflated among those of ¢/, ¢3, and [%; also, the Type I error rates of ¢F at both 
1% and 5% levels, though inflated, are almost always satisfactory according to Cochran’s 
criterion for robustness (Cochran, 1952) that specifies that Type I error rates below 0.06 
and 0.015 are satisfactory at 5% and 1% levels, respectively. Further, Figure 3 shows that 
for short tests, the distribution of C5 is closest to the theorized standard normal distribution 
among the three asymptotically standardized PFSs considered in this paper. Further, the 
power of C3 is often the largest, especially for the long tests, among all the PFSs considered 


in this paper. 
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Additional Simulations Where Item Parameters are Estimated 


The above simulations were performed under the assumption that the true item 
parameters are known. This assumption is reasonable in several cases such as those with 
large samples for which accurate and precise estimates of item parameters are available 
and is common in existing simulation studies involving PFSs (e.g., Snijders, 2001; van 
Krimpen-Stoop & Meijer, 1999). However, in practice, true item parameters are never 
known and item parameters have to be estimated from the data. Therefore, an additional 
simulation was performed where item parameters were estimated. The simulation involved 


1,000 repetitions of the following steps: 
1. Simulate a set of true item parameters. 


2. Simulate 1,000 true ability values (representing 1,000 examinees) for a data set uni- 


formly from the nine above-mentioned values. 


3. Use the above true item and ability parameters and the IRT model (3PLM+GPCM) 


to simulate a data set. 


4. Estimate the item parameters using the marginal maximum likelihood algorithm from 


the data set. 
5. Compute 6; of all the examinees from the data set using the estimated item parameters. 


6. Compute G1, C2, Cf, G3, and /% for each examinee in the data set using 6; and the 
estimated item parameters. Compute the p-values corresponding to these PFSs under 


a standard normal null distribution assumption. 


As in, for example, Glas and Meijer (2003), power was computed from data sets 
that included 90% examinees whose item-scores were simulated from the IRT model 
(3PLM+GPCM) and 10% examinees whose item-scores involved one type of misfit (lack of 
motivation or item disclosure). 

The Type I error rates of the PFSs from this set of simulations were very similar to 


those in Figure 2 and are not reported here. The values of power of the PFSs from this set 
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of simulations are slightly smaller in general than those shown in Figures 4 to 6. Figure 7 
shows the power at 5% level of the PFSs for 60-item tests when item parameters were 


estimated. The values of power in Figure 7 are mostly smaller than those in Figure 6, 


Item Disclosure 1/6th Item Disclosure 1/3rd 


2 -1 0 1 2 
Lack of Motivation 1/3rd 


Figure 7. Power at 5% level for 60-item tests when item parameters were estimated. 


especially in the top left and bottom right panels. For example, while the power of /% for 


true ability of 0 is about 0.98 in the bottom right panel of Figure 6, the corresponding value 
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of power is about 0.74 in the bottom right panel of Figure 7. However, the comparative 
performance of the PFSs is similar in Figure 6 and Figure 7 in that ¢; and C2 are less 
powerful than ¢y and ¢3, respectively, and it is difficult to pick an overall winner among 
the PFSs—no PFS has the largest power in all cases and each of the asymptotically 
standardized PFS has the largest power in some cases. The difference in power between ¢, 
and ¢j (and also between ¢2 and ¢3) is smaller in Figure 7 than in Figure 6. Overall, the 
estimation of item parameters does not affect the comparative performance of the PFSs, 


especially of 7, Cf and C5. 


Application to Real Data 
Application 1: A Test with Only Polytomous Items 


Let us consider the data set from NEO Personality Inventory that was analyzed by 
Glas and Dagohoy (2007) and Sinharay (2016c). * The NEO Personality Inventory is a 
personality test designed to provide a general description of normal personality that is 
relevant to clinical, counseling, and educational situations. The inventory consists of five 
broad domains. Each domain is measured by 48 items each of which is rated on a five-point 
scale. Data from 1,168 individuals on the neuroticism domain was analyzed in Glas and 
Dagohoy (2007) who split the 48 items on the domain into three sub-tests. The GPCM was 
fitted to each subtest separately and (1, Cy, G2, ¢3, and 1} were computed for the examinees. 
The percent of individuals for which the PFSs were significant (under the assumption of 
a standard normal null distribution) at 5% significance level for the three subtests are 


provided in Table 2. 


Table 2. The proportion of PFSs significant at 5% level for the first application. 


Subtest (1 Gf G @ lt 
if 76 82 7.4 7.5 11.4 
2 To Tor 6 Ter 6 
3 6.6. 7.2 “7.8 8.00 115 


3The author is grateful to Cees A. W. Glas for generously sharing the data set. 


26 


The percent is largest for /¥ for each subtest. However, the true person-misfit status is 
unknown for the data set—so it is impossible to state if the larger percent for /? indicates 
its larger power or larger Type I error rate. The percent significant for each of ¢; and C2 is 
slightly smaller than that of the corresponding asymptotically standardized PFS for each 
subtest. 


Application 2: A Mixed-format Test 


The PFSs were computed using data from one form of a licensure examination. The 
data set was analyzed in several chapters of Cizek and Wollack (2017). The test form 
includes 170 operational items that are dichotomously scored. Item scores on the form 
were available for 1,644 examinees. The licensure organization (that administers the 
examination) flagged, using a variety of statistical analysis and an investigative process 
that brought in other information, 48 individuals on the form as possible cheaters. The 
examinees flagged by the licensure organization can be considered truly aberrant for all 
practical purposes because of the rigorous nature of the flagging process of the organization. 

Because the test included only dichotomous items, an (artificial) MFT was created 
by pooling 20 pairs of randomly chosen items into 20 3-category polytomous items and 
combining them with the 130 remaining items in the data set. Let’s consider the polytomous 
item arising from a given item pair; an examinee is assigned a score of 0 if he/she answered 
both items in the pair incorrectly, is assigned a score of 1 if he/she answered exactly one 
item in the pair correctly, and is assigned a score of 2 if he/she answered both items in the 
pair correctly. The Rasch model is operationally used in the assessment; the model fitted 
to the resulting data set (with 130 dichotomous items and 20 three-category polytomous 
items) was a combination of the Rasch model for the dichotomous items and the partial 
credit model (Masters, 1982) for the polytomous items. The difficulty-parameters were 
estimated from the resulting data set and were used in the PFA. The MLE (truncated 
between -4.0 to 4.0) was used as the estimate of the examinee ability. 

The proportions of examinees for which ¢), Ci, G2, ¢3, and /% were significant at 5% 


significance level are provided in Table 3. The first row of numbers provides the proportions 
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Table 3. The proportion of PFSs significant at 5% level for the second application. 


Examinees (1 G ©) G3 lI; 
All 0.08 0.09 0.08 0.08 O.11 
Flagged 0.15 0.17 0.15 0.17 0.21 


among all examinees. The second row of numbers provides the proportions among the 48 
examinees who were flagged by the licensure organization; thus, for example, the proportion 
of 0.17 for ¢j in the second row indicates that among the 48 examinees flagged by the 
licensure organization, Cj was significant for nine examinees (note that 8/48~0.17). 

Table 3 shows that the results for (1, 2, Cf, and ¢3 are very similar although the 
proportion significant among the flagged examinees for each of ¢,, and C2 is slightly smaller 
than of equal to that of the corresponding asymptotically standardized PFS. Table 3 also 
shows that the proportion significant for each PFS is about twice among the examinees 
flagged by the licensure organization compared to among all examinees—this result provides 


some evidence that the PFSs provide useful information. 


Conclusions 


This paper suggested four new PFSs—the extensions to MFTs of ¢; and ¢ (Tatsuoka, 
1984) and their asymptotically corrected versions. The PFSs also apply to a test including 
only polytomous items. The asymptotically corrected versions were theoretically proved 
to follow the standard normal distribution asymptotically under no person misfit. In a 
simulation study, the Type I error rates of the corrected versions were found close to the 
nominal level at 5% level of significance, but slightly inflated at 1% level of significance. 
The corrected versions have slightly larger power than the corresponding un-corrected 
versions. In a real-data example where some score patterns were known to be aberrant, 
the PFSs provided useful information by flagging aberrant examinees at a larger rate than 
others. All these properties are in consonance with those for the existing asymptotically 
corrected PFSs (Magis, Beland, & Raiche, 2014; Snijders, 2001; Sinharay, 2016c, 2016b). 


Given the increasing importance of polytomous items (e.g., Darling-Hammond & Adamson, 
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2010) and PFA, the suggested PFSs promise to help practitioners. 

The suggested PFSs are appropriate as PFSs when an investigator wants to test 
against an unspecified general alternative and may not be the most appropriate PFS for all 
applications of PFSs. For example, if one needs a PFS for a computerized-adaptive test, a 
PFS such as that of van Krimpen-Stoop and Meijer (2002) will be the most appropriate. If 
the anticipated model violation is more specific, a PFS such as the Lagrangian-multiplier 
statistic (Glas & Dagohoy, 2007) may be more powerful. 

The suggested PFSs were used for PFA in this paper. However, they can be used for 
other purposes as well. For example, Tatsuoka (1984) used ¢; and ¢2 for dichotomous items 
for diagnosing student misconceptions and all the suggested PFSs can be used for those 
purposes as well; ¢f and ¢3 for MFTs, which were proved to follow an asymptotic standard 
normal null distribution, would help detect student misconceptions accurately. 

There are several limitations of this paper and, consequently, several additional topics 
that can be investigated further. First, the Type I error rates of ¢{ and ¢3 were found to be 
larger than the nominal level for significance levels around 1%. Improvement on this 
Type I error inflation is a possible future research topic. It is possible to perform Monte 
Carlo simulations, as was performed by Sinharay (2016a) and van Krimpen-Stoop and 
Meijer (1999) for dichotomous items, to obtain an empirical null distribution of the PFSs. 
Second, Figure 2 shows that for short tests, the Type I error rates of Cf rapidly increases as 
the true ability becomes smaller than -1.5 and the Type I error rates of /* rapidly increases 
as the true ability becomes larger than 1.5. It would be helpful to be able to explain this 
phenomenon. Third, it would be interesting to explore ability estimates other than the 
MLE, WLE, and MAP that are considered in this paper; robust ability estimates would be 
prime candidates because they would be less influenced by unusual responses, which may 
lead to larger power of the resulting PFSs. Sinharay (2016d) used robust ability estimates 
with asymptotically corrected PFSs for dichotomous items. Fourth, it may be of interest to 
examine the Type I error rate and power of the PFSs for more simulated and real data 
sets. For example, factors such as spread of item difficulties and average item 


discrimination often influence the power of PFSs in simulations (e.g., van Krimpen-Stoop 
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& Meijer, 1999); however, these factors were not manipulated in the simulations here and 
can be manipulated in future research. Finally, further research such as Conijn, Emons, 
and Sijtsma (2014), Meijer, Niessen, and Tendeiro (2015), and Meijer and Tendeiro (2014) 
should explore how PFA can be used in practice, for example, in high-stakes educational 


tests. 
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